# Replication Package for "Induced Automation Innovation: Evidence from Firm-level Patent Data"
## David Hémous, Morten Olsen, Antoine Dechezleprêtre, and Carlo Zanella, Journal of Political Economy, 2024

Package version: August 2024

## General Instructions
This package contains data and code for the replication of tables and figures from the paper and its online appendix, including intermediate data and some but not all raw data used. All freely available raw data such as OECD, BFS or EUklems etc. are included. Paid/Proprietary patent and Firm data from the European Patent Office (EPO), Patstat and Orbis is NOT included. We provide code for cleaning, corrections and aggregation of excluded data and ask that you obtain it yourself from the respective sources. We provide intermediate data that used the proprietary source data if it is distinct enough from it and have compiled aggregated datasets to build figures and tables that would otherwise require the proprietary source data so that all tables and figures can be rerun inside this package even without the full original data. For more detailed information on individual sources and where to obtain them, please consult the enclosed "Sources.md". Included raw data can be found in the datasets/common-data , /datasets/Wage_data and /datasets/ALM folders.

The genearal order is as follows, with statistics created along the way:

1/2/3.  Import patstat 2018b
1/2/3.  Run the patent index and classification on EPO patent data
1/2/3.  Import wages, GDP and labor productivity
4.      Apply the EPO classification to patstat
5.      Assemble all of that in the main pipeline to the various datasets used in the analysis. This branches off a couple of times depending on the analysis. 
        But generally:

            5.1 Patent weights
            5.2 Stocks
            5.3 Spillovers
            5.4 LHS variables
            5.5 RHS variables

6.      ALM reproduction
7.      Montecarlo- and macro-simulations
8.      Output tables and figures

The package consists of three types of code: Powershell, Python and Stata. Python is used for the indexing and classification of patent data from EPO, Stata for applying the obtained classification to Patstat 2018 data and construction of all further data and outputs. Powershell calls python and stata code and guides you through the replication. Caution: Powershell only works on Windows. If you are running on a unix based system (Linux, MacOS), we advise either coding your own shell equivalents or moving to a windows machine. Emulation will make the already long runtime for the whole project, especially the simulations, even longer.

The project consists of multiple subpackages. All stata code is contained in the /code directory. It is sparated into the following:

-"assembly" creates all datasets in the main pipeline.
    -"ALM" does all data construction for the reproduction of the results by Autor et al. (2003).
    -"Measures of Wages"  creates the initial panel of wages by skill level, GDP, VA, etc. corrected for exchange rates and inflation.
    -"Timmer" creates a dataset that includes our offshoring data, deriveds From results and ata by Timmer et al. (2014)
    -"patstat" imports the tls files from patstat and cleans/imputes where necessary.
    -"classification" contains all stata code used in the classification
-"config" contains settings for variabels, paths and outputting tables and figures.
-"macrosim" contains stata and powershell code for the macrosimulation (figure 4 and Table A43).
-"montecarlo" contains code for the three versions of montecarlo simulation.
-"tables" contains all stata code for the tables.
-"figures" contains all code for producing the figures.

The python code is located in the /classification directory, as it works as a separate package that can be reused for other projects. The two main Powershell files are located in the root directory and the classification directory respectively and *can* be run independently if so desired, provided you have obtained the EPO patent data (and/or the patstat data). Be aware that the stata code for applying the classification to patstat 2018 is still located with the rest of the DHOZ construction in the /code folder. Both main powershell files manage software and version dependencies as we have run them - we do not guarantee running will work with newer versions.

All processed data used in the tables and figures, save for the statistics on classification, are located in the datasets/final_data folder. Tables and figures get outputted into their respective directories as tex files and then compiled as pdfs. A handful of tables we have manually combined (or in the case of the simulation, shortened) for the paper. Figures contained in the text can be found in the /numbers diretory as they are saved there during construction, though some have been done by hand in the paper itself. 

## Before Running 

### EPO data
Make sure the external drive/directory with EPO data is connected and that you have the path on hand. If you do not have the EPO data, the script will allow you to skip files that need it.

### Patstat data
Put the required files into the /datasets/common_data/patstat_raw folder, make sure they are unzipped (we imported as .txt files). You will need all parts of the following tls file codes:

- 201 | 204 | 207 | 209
- 211 | 212 | 216
- 224 | 228 | 230 | 231
- 801 | 901 | 906

If you do not have the patstat data, the script will allow you to skip files that need it.

### Orbis data
If you do not have this, the script will allow you to skip code that uses it. If you have it, palce the corresponding txt files in the ./datastes/orbis_patents/utf8 folder as well as Orbis_patents_2017_company_names.txt and Orbis_patents_2017_DUO_GUO.csv in the orbis_patents folder. Nace BvD crosswalk go into the nace_orbis_2017 folder and all other nace (guo/duo) into the main orbis_patents one.

### Mann-Puettmann data
Put the files into ./datasets/common_data/mann_puetmann. we import as .csv

### software
- Make sure powershell is installed. We used version 5.1.19041.4648
- Make sure Pthyon 3.11 is installed if you want to run the classification or orbis sections. 
    - The both will launch a pipenv to manage dependencies.
- Make sure you have a latex distribution installed. We assume Miktex is installed and check for it. If Powershell cannot find a LATEX distribution, it will prompt you and ask for a path.
- Make sure Stata 18 is installed. Powershell will ask for the executable. The default windows location is prespecified. Keep your path handy if you installed elsewhere.
    - We use 'require' to manage stata packages, it too will install, as well as any packages in 'the requirements.txt' file.
-I f you want to run the offshoring (Timmer et al.) part, make sure matlab is installed. We last used R2023a.

### Hardware
The package is quite demanding. We advise at least 24 GB of RAM and making sure your stata temporary files have ample space beyond that (see https://www.stata.com/support/faqs/data-management/statatmp-environment-variable/ for more information). Should you want to run the macrosimulations, we advise **at least** 12 cores. Stata does multi-threading for up to 4 cores per stata instance. The macrosimulation runs up to three simultaneously.

## How to run
0. make sure all files are unzipped and. Particularly the following ones, which we had to compress due to the JPE's filesize limit:

    - ./datasets/final_data/famili_timeseries_full.dta
    - ./datasets/final_data/regression_dataset_from1970_tfacit1.dta
    - ./datasets/final_data/regression_dataset_from1970_pauto95.dta
    - ./datasets/final_data/regression_dataset_from1970_GDP0.dta
    - ./datasets/final_data/regression_dataset_from1970_GDP1.dta
    - ./datasets/dep_vars/bvd_year_depvars.dta
    - ./datasets/indep_vars/bvd_year_indepvars_sharesgdpweighted_excluding_final_tfacit1.dta

JPE also changes csv and files to .tab files. WE do not use .tab files. If you see any , convert them or you will get errors.

1. Launch powershell
2. Set the working directory to the ./DHOZ_replication_jpe folder using cd. E.g. "cd C:\Users\user\Downloads\DHOZ_JPE"
3. Launch the main Powershell file using ./master_DHOZ.ps1
4. Follow the instructions in the powershell window. It will prompt you a couple of times for paths and choices regarding which parts to execute, available proprietary data and install dependencies.
5. Runtimes are quite long. We have kept the iteration count for the simulations the same as in the paper.
    5.1. Even just running all tables and figures will take you 2-3 hours, depending on your machine.
    5.2  Main assembly pipeline with proprietary data, excluding classification and patstat: ~ 48 hours
    5.3. Montecarlo simulations ~ 96 hours, each (3x)
    5.4. Macrosimulations ~ 72 hours


## Error handling
While we have tested extensively, code can always fail. Logs are in the corresponding directory. Should a file in the assembly fail to execute, powershell will terminate the script and give you an error message with the corresponding failed file to avoid running into more errors down the line. We save the name of the failed file (lastrun.txt), so you may resume from that point once you have resolved the error. When tables or figures fail, you will get an error but it will not terminate the script. If you have a slow disk, you may want to increase the sleep timers in the stata code preemptively to avoid read/write errors.